Introduction

Airport executives and state and local officials are curious about customer satisfaction at San Francisco International Airport. Business and tourist travelers are a huge source of revenue for the city. SFO is in a prime location relative to other west-coast cities, and it is well situated as a long-term layout spot for international travel to Asia. Because of this, identifying current strengths of the airport and areas for improvement are critical to increasing traffic and revenue.

Data & EDA

Marketing executives developed a survey and administered it to customers over the previous year. SFO invested a lot of time and resources in collected comprehensive data for 3,234 customers on 100 variables. They have now hired us as their data science consultants to gain insights into these data. The SFO team is not well-versed in data science methodology, but they have a few key areas of interest for us to look into.

Let’s do some basic EDA and aggregated statistics on our data to gain a better insight of it. We can create some visuals for number of flights by terminal and destination category and also can have a look the busiest gates for each terminals.

Part A

The SFO team has three (3) specific questions they want us to investigate.

Question 1

Customers were asked to rate their opinion of the “SFO Airport as a whole” on a scale from 1 (“unacceptable”) to 5 (“outstanding”). The executives want to know if there are patterns across the satisfied or dissatisfied customers based on demographic characteristics, such as sex, age group, and income level.

Hypothesis

We are theorizing that there will be distinct patterns that emerge in the demographic data, but are yet unsure of what they will show.

Analysis and Results

For the first question, we plan to start by visualizing the relationship between overall satisfaction and the various demographic groups. We should be able to develop initial observations that we can test further with a regression model. The regression model should be able to definitively prove/disprove our initial theories based on the visuals.

The first step is to convert the columns for age, income and sex into something easier to read and take a look at satisfaction ratings (Question 6n) grouped by various attributes (gender, income, age, etc.)

We can see that the most common response for “How does SFO Airport rate as a whole” was a 4, which is one step below the highest score. Worth noting here is that while 6 is an option, it represents “Have never used or visited/not applicable” so we consider a 5 the highest score. The largest number of respondents who gave SFO a 4 appear to have income within $50,000 - $100,000. This could be because they liked it best, or because they are the largest respondent group. We also see that male respondents had a higher proportion of ‘3’ responses than ‘5’, the opposite being true for females. We will want to test this theory later in a regression model.

Now let’s take a look at the responses grouped by Income and Age.

We see a similar pattern in this data, with 4 being the most common response for all income levels. Though we have a general idea of the distribution, we want to run a regression model to see if we see any kind of pattern.

## 
## Call:
## lm(formula = Score ~ Gender + Age_Group + Income, data = satisfaction_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0818 -0.2010  0.0401  0.2066  2.2599 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              4.44719    0.22431  19.826  < 2e-16 ***
## GenderMale              -0.07264    0.02886  -2.517   0.0119 *  
## Age_Group18-24          -0.24620    0.22768  -1.081   0.2796    
## Age_Group25-34          -0.38981    0.22469  -1.735   0.0829 .  
## Age_Group35-44          -0.34613    0.22553  -1.535   0.1250    
## Age_Group45-54          -0.29279    0.22550  -1.298   0.1943    
## Age_Group55-64          -0.21121    0.22594  -0.935   0.3500    
## Age_Group65 and Over    -0.22295    0.22742  -0.980   0.3270    
## Income$50,000-$100,00   -0.12186    0.03957  -3.079   0.0021 ** 
## Income$100,001-$150,000 -0.18440    0.04538  -4.063 5.00e-05 ***
## IncomeOver $150,000     -0.28834    0.04443  -6.490 1.04e-10 ***
## IncomeOther             -1.20098    0.70572  -1.702   0.0889 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.704 on 2430 degrees of freedom
## Multiple R-squared:  0.03452,    Adjusted R-squared:  0.03015 
## F-statistic: 7.899 on 11 and 2430 DF,  p-value: 1.249e-13

Based on this simple regression, we see that only 4 of the demographic categories have a statistically significant impact on the overall rating: GenderMale, Income$50,000-$100,00, Income$100,001-$150,000, and IncomeOver $150,000. Each of these 4 categories appears to have an overall negative influence on the satisfaction score.

Discussion

Based on our demographic analysis, we see that only 4 of the demographic categories have a statistically significant impact on the overall rating: GenderMale, Income$50,000-$100,00, Income$100,001-$150,000, and IncomeOver $150,000. Each of these 4 categories appears to have an overall negative influence on the satisfaction score. What this tells us that, in general, Men tend to have more negative opinions of the airport. Also, the overall opinion of SFO seems to decline as a person’s income increases.

Question 2

The executives also want to know if customer satisfaction can be broken down into different attributes of the airport. Knowing this will help the team target specific strengths or areas of improvement. The central feature the customer satisfaction survey is a 14-question portion of the survey asking customers to rate satisfaction with different aspects of the airport (see Question 6 in the data directory). The executives want you to perform a quantitative analysis to determine if there are broad themes that emerge from this part of the survey.

Hypothesis/Hypotheses

Based on the fact that the questions fall into distinct categories, we are theorizing that there are distinct underlying factors that influence overall satisfaction. What we do not know though is which are the most/least important areas of the airport as it pertains to overall satisfaction

Analysis and Results

Factor Analysis

For the second question, we feel that the best approach will be factor analysis. This should allow us to see if there are commonalities in the responses, and from there determine which questions appear to be the “most important,” or at least most indicative of positive overall satisfaction.

We will need to look at the responses to all parts of Question 6:

How does SFO rate on each of the following attributes?

  • 6a. Artwork and exhibitions
  • 6b. Restaurants
  • 6c. Retail shops and concessions
  • 6d. Signs and directions inside SFO
  • 6e. Escalators/elevators/moving walkways
  • 6f. Information on screens/monitors
  • 6g. Information booths (lower level near baggage claim)
  • 6h. Information booths (upper level – departure area)
  • 6i. Signs and directions on SFO airport roadways
  • 6j. Airport parking facilities
  • 6k. AirTrain
  • 6l. Long term parking lot shuttle
  • 6m. Airport rental car center
  • 6n. SFO Airport as a whole

Let’s take a look at a correlation plot to see if any patterns jump out.

Immediately we see that questions about the information booths (6H and 6G) are almost perfectly positively correlated, likely meaning the information booth quality is the same, regardless of its location. We also see there is a string positive correlation among many of the travel and parking related responses (6J, 6K, 6L, 6M).

What we see in the correlation plot leads us to think there may be some themes here, so we will do a more thorough factor analysis to see if this holds true. To begin, we will see how many factors will be best to use.

From the data above, we see that either 3 or 4 factors will likely be best. We will try both to see if there is a meaningful difference.

## 
## Loadings:
##     MR1    MR2    MR3   
## Q6A  0.336         0.163
## Q6B  0.468  0.130       
## Q6C  0.477  0.116       
## Q6D  0.744 -0.114       
## Q6E  0.692              
## Q6F  0.733              
## Q6G                0.997
## Q6H                0.882
## Q6I  0.116  0.615       
## Q6J         0.719       
## Q6K         0.692       
## Q6L -0.148  0.709       
## Q6M         0.647       
## Q6N  0.797        -0.120
## 
##                  MR1   MR2   MR3
## SS loadings    2.808 2.354 1.833
## Proportion Var 0.201 0.168 0.131
## Cumulative Var 0.201 0.369 0.500

With 3 factors, we see that people who answered positively about shopping, navigational aides, and moving walkways (6A,6B,6C,6D,6E,6F) also tended to answer positively for the overall score (6N). One interpretation of factor MR1 as people are often in a hurry to catch their flight and appreciate being able to move through the airport quickly. Another option is that people are appreciative of the food and shopping.

We also see in factor MR2 the same parking and transportation correlation we saw in the plot. People who answered positively about the parking also did so about airport transportation services (6I,6J,6K,6L,6M). However, positive responses in these questions did not appear to effect the overall score.

Finally, we see that factor MR3 is the grouping of information booths we saw in the correlation plot. Positive responses about either floor’s information booth tended match positive responses for the other. However, positive responses in these questions did not appear to effect the overall score.

Next we will review 4 factors to see if any further patterns emerge.

## 
## Loadings:
##     MR2    MR1    MR3    MR4   
## Q6A                0.139  0.385
## Q6B                       0.872
## Q6C                       0.700
## Q6D         0.815        -0.107
## Q6E         0.623              
## Q6F         0.757              
## Q6G                0.944       
## Q6H                0.914       
## Q6I  0.620  0.114              
## Q6J  0.692                     
## Q6K  0.702                     
## Q6L  0.693 -0.121              
## Q6M  0.646                     
## Q6N         0.643         0.160
## 
##                  MR2   MR1   MR3   MR4
## SS loadings    2.259 2.094 1.780 1.450
## Proportion Var 0.161 0.150 0.127 0.104
## Cumulative Var 0.161 0.311 0.438 0.542

We see mostly the same results here, but now the shopping, food, and navigational information responses are in different factors, with navigation and moving walkways (MR1) having a larger positive correlation with a positive overall score. Food and shopping (MR4) also seems to be related to a positive overall score, but to a lesser extent.

We see the same results for both factors MR2 (transportation and parking) and MR3 (information booths).

With 4 factors, it becomes a bit more clear what most respondents seem to favor: clear navigational aides and moving walkways (MR1). While shopping is also important, the priority for travelers seems to be ease/speed of navigating within the airport.

Latent Class Analysis

First, we drop the NAs, and then code 0 as 6 so that the LCA has the classes properly ordered. What this means is that a blank score, in this case, we put equivalent to a “not applicable” score. We tried, 3, 4, and 5 groups of classes. Five classes has the lowest BIC over three and four classes, so we chose five classes for our analysis.

Approximately 7.3% of the population falls into class 1, approximately 24% of the population falls within class 2, approximately 19% of the population falls within class 3, approximately 29% of the population falls within class 4, and approximately 21% of the population falls within class 5. Generally, class 1 votes a 5 on the satisfaction with SFO, class 2 votes a 4 on the satisfaction with SFO, class 3 votes a 3 on the satisfaction with SFO, class 4 generally votes a 3 or a 6 on the scale, and class 5 generally votes a 5 or a 6 on the satisfaction scale.

Discussion

From our LCA analysis, we can see that 24% of the respondents are likely to be satisfied at a level 4 with SFO. 7.3% are likely to respond with a 5 on the satisfaction scale, and 19% are likely to vote a 3 on a satisfaction scale. This means that whereas 19% of the responses are a likely to be a 3, there is definite room for improvement at SFO.

Overall, we have found that though there seem to be four total “themes”, there are only two “themes” that have a noticeable relationship to overall satisfaction:

  • 1 - Food services and shopping
  • 2 - Navigational signs and moving walkways

Of those, it would seem that the navigational signs and moving walkways have a higher correlation with overall satisfaction. From this, we conclude that people are often in a hurry and prioritize the ability to get to their get quickly and easily. Also somewhat important are the available food services and shops.

Question 3

Free-response comments, either good or bad, were collected in addition to the 14-item quantitative survey. The executives are not quite sure how to examine it without going through individual surveys one by one, but they want you to see if there are any concepts or insights that arise from these responses. Do the free responses relate to the findings in a) or b) at all?

Hypothesis/Hypotheses

We strongly believe that the free response questions will align with the overall satisfaction scores. If the satisfaction score is not impressive the comments from the travelers should reflect a overall negative sentiment or dissatisfaction.

Analysis and Results

For the third question, we believe that sentiment analysis will be the best way to gather insight from free-form responses. We should easily be able to determine if the overall sentiment matches the overall numerical responses.

For sentiment analysis we have a look at the combined comment column in the given dataset. The comments section does contain some special characters; we need to remove those before we can perform the analysis. Let’s generate a word cloud from all the comments we have.

Analysis by Age Group

Now let’s count positive and negative words for each age group and calculate the sentiment.

Let’s plot the positive and negative sentient count for each age group .

We can see across all age range the negative sentiment is slightly higher than the positive sentiment.

Now instead of bing we can use affin which assigns a value positive/negative to each word and calculate the average sentiment for age group we have.

Here also we see the average sentiment is negative almost all age range except the population under 18.

Now if we perform the similar analysis for each gender and each income group we see similar results here.

Analysis by Gender

Analysis by Income Group

Discussion

So from the sentiment analysis of the free response comment we see the dominating sentiment is negative irrespective of age, gender and income groups here. It might be due to the fact that people with negative experience are usually are the one to fill the survey but there are definitely areas of improvements. The overall picture of sentiments certainly matches with the quantitative results we got in previous questions.

Part B

The SFO executives feel that additional insights can be gained from the customer satisfaction survey dataset. Based on your prior EDA deliverable and the topics we have discussed in class, develop an additional research question and execute a plan to evaluate it with these data using a method we covered this semester. Provide an appropriate explanation of your method of choice and how it applies to your question. If formal hypotheses are tested, clearly explain the results of these tests. If the method is more descriptive or data-driven, define how the results are evaluated, and provide sufficient output and data visuals to communicate the outcome. You don’t need to fish for a “significant” finding here; even null or unexpected results can be useful if the hypothesis is reasonable.

Research Question

It is clear that the SFO airport needs to address certain areas to achieve higher customer satisfaction. We suspect there are multiple “themes” that are latent in the free customer responses. We can try topic models to find out the themes.

Analysis and Results

From the previous word cloud we can see the word “airport” in a lot of the entries. We don’t want this messing with our topics on down the line, so let’s remove that. We can add the word ‘airport’ in the stopwords list. In addition to custom stopwords, we’ll also like to include the SMART words again (it is the default) and the English stopwords.

The stm package has some pretty nice facilities for determining a number of topics. We can try topics from 2 to 5 and see what we get.

## Removing 6 of 188 terms (6 of 12168 tokens) due to frequency 
## Your corpus now has 1557 documents, 182 terms and 12162 tokens.

Looking at the residual and Semantic Coherence we think 4 or 5 topics is the best to go for. After looking into both we decided to stick with 4 topics as it was looking more realistic. With our 4 topics, we can start our actual model. If we plot the stm result, we get the proportional prevalence of the topics, with some keywords for each topic.

Additional functions can help to understand the words that fall into each topic, which assist in identifying and labeling the topics.

## Topic 1 Top Words:
##       Highest Prob: comment, hard, find, signag, confus, posit, airlin 
##       FREX: comment, find, signag, posit, general, sfo, area 
##       Lift: area, correct, neg, pos, signag, water, comment 
##       Score: comment, find, electr, sfo, signag, area, posit 
## Topic 2 Top Words:
##       Highest Prob: termin, inform, difficult, small, chang, amen, personnel 
##       FREX: inform, chang, amen, display, lack, rapid, screen 
##       Lift: amen, atm, autom, booth, chang, check, clock 
##       Score: inform, outlet, rapid, screen, small, display, lack 
## Topic 3 Top Words:
##       Highest Prob: secur, long, ineffici, custom, line, ineffect, procedur 
##       FREX: secur, ineffici, custom, line, ineffect, procedur, seat 
##       Lift: allow, crowd, custom, cut, effici, front, healthier 
##       Score: custom, ineffici, line, ineffect, procedur, secur, healthier 
## Topic 4 Top Words:
##       Highest Prob: uniqu, restaur, food, expens, shop, free, servic 
##       FREX: uniqu, restaur, expens, shop, access, wifi, cover 
##       Lift: luggag, uniqu, access, air, bad, chain, club 
##       Score: restaur, uniqu, expens, cover, entir, overload, didn

From the plot of the topics and from the example words we can conclude that -

  • Topic 1 is likely expressing about the directions and signs in the SFO airport
  • Topic 2 is likely expressing about the small display and lack of information displayed.
  • Topic 3 likely expressing about the customs and security inefficiency and long waiting time.
  • Topic 4 likely expressing about the restaurants availability and food choices.

We can look at statements that have a high probability of being associated with each topic here. This presents documents that are representative of each topic.

## 
##  Topic 1: 
##        General cleanliness pos comment  Bart ground transportation Pos comment  Signage inside airport confusing small hard to find gate or airline 
##  Topic 2: 
##        Information screens Too small not enough lack information displays change too rapidly  Information booth not staffed personnel unhelpful not knowledgeable  More humans fewer electronic automated signs 
##  Topic 3: 
##        Security Customs lines procedures long inefficient ineffective  Security Customs personnel inefficient rude not well trained  Don’t appreciate people cutting in line being allowed in front of me 
##  Topic 4: 
##        Need fast food chain restaurants  Wifi not free long enough difficult to access doesn’t cover entire airport overloaded didn’t know was available

We can see the statements are along the expectations we mentioned earlier by each topic.

Discussion

As we suspected from our factor analysis and LCA, we found multiple latent themes inside the customer comments. After carefully analysis We suggest that SFO needs to concentrate 4 areas to gain better customer satisfaction.

  • One would be the directions and signs. We may need to put more direction signs and may be in other languages too to help the foreign travelers.
  • Second would be the modernization of display boards and increase there sizes.
  • Third would be to address the issue of longer lines and wait times related customs and security clearance.
  • And lastly to revamp the restaurant and food choices and include more options for the diversified traveler SFO has.

Appendix

#Import the libraries
library(tidyverse)
library(haven)
library(plotly)
library(reshape2)
library(lmtest)
library(lme4)
library(psych)
library(GPArotation)
library(ggcorrplot)
library(tidytext)
library(wordcloud2)
library(sentimentr) 
library(lexicon)
library(magrittr)
library(tidyr)
library(stm)
library(DT)
library(poLCA)
#sets the seed
set.seed(1847)
#Read the data set
#sfoDf <- read_delim('C:/MSDS/Fall 2020/Behavioral Data Science/Project/SFO_survey_withText.txt', delim="\t")
sfoDf <- read_delim('SFO_survey_withText.txt', delim="\t")
#Print the data
head(sfoDf)
############
## EDA   ##
###########
#Provide basic statistics of the data
summary(sfoDf)
#Count the NA or missing Rows for each columns
sapply(sfoDf, function(x) sum(is.na(x)))
#Count the NA or missing Rows for each columns
unlist(sapply(sfoDf, function(x) { 
  if (sum(is.na(x)) > length(x)* 0.5)  
    return (sum(is.na(x))) 
  }))
#Convert the Sex
sfoDf['Gender'] <- ifelse(sfoDf$Q18 == 1, "Male",
                          ifelse(sfoDf$Q18 == 2, "Female", "Blank"))
#Convert the Age Group column
sfoDf['Age_Group'] <- ifelse(sfoDf$Q17 == 1, "Under 18", 
       ifelse(sfoDf$Q17 == 2, "18-24", 
              ifelse(sfoDf$Q17 == 3, "25-34", 
                     ifelse(sfoDf$Q17 == 4, "35-44", 
                            ifelse(sfoDf$Q17 == 5, "45-54", 
                                   ifelse(sfoDf$Q17 == 6, "55-64", 
                                          ifelse(sfoDf$Q17 == 7, "65 and Over",
                                                ifelse(sfoDf$Q17 == 8, "Don't Know/Refused", 
                                                       ifelse(sfoDf$Q17 == 0, "Blank", "NA")))))))))
#Convert the Income Column
sfoDf['Income'] <- ifelse(sfoDf$Q19 == 1, "Under $50,000", 
                          ifelse(sfoDf$Q19 == 2, "$50,000-$100,00",
                                 ifelse(sfoDf$Q19 == 3, "$100,001-$150,000",
                                        ifelse(sfoDf$Q19 == 4, "Over $150,000",
                                               ifelse(sfoDf$Q19 == 5, "Other", "Blank")))))
#Airline terminal
# 1 Terminal 1
# 2 International Terminal
# 3 Terminal 3
# 0 Unknown
flightByTerm <- sfoDf %>% 
  group_by(TERM) %>%
  summarise(Flight_By_Terminal = n()) %>% 
  rename(Terminal = TERM)
flightByTerm
fig <- plot_ly(
  flightByTerm,
  x = ~as.character(Terminal),
  y = ~Flight_By_Terminal,
  type = "bar"
)
fig <- fig %>% 
  layout(title = " ",
         xaxis = list(title = "Terminal No"),
         yaxis = list(title = "No of Flights By Terminal"))
fig
#Airline type (based on Sampling Plan)
# 1 Major carriers
# 2 Small/International carriers
# 3 New carriers
sfoDf %>% 
  group_by(ATYPE) %>%
  summarise(Flight_By_Carrier = n()) %>% 
  rename(Airline_Type = ATYPE)
# DEST Destination of flight
# 1 Within California
# 2 Out of state
# 3 Out of country
sfoDf %>% 
  group_by(DEST) %>%
  summarise(Flight_By_Destination = n()) %>% 
  rename(Destination = DEST)
flightByTCD <- sfoDf %>% 
  rename(Terminal = TERM, Airline_Type = ATYPE, Destination = DEST) %>% 
  group_by(Terminal, Airline_Type, Destination) %>%
  summarise(Flight_Count = n()) 
flightByTCD$Terminal <-  as.factor(flightByTCD$Terminal)
levels(flightByTCD$Terminal) <- c("Terminal 1", "Terminal 2", "Terminal 3")
flightByTCD$Airline_Type <- as.factor(flightByTCD$Airline_Type)
levels(flightByTCD$Airline_Type) <- c("Major carriers", "Small/International carriers", "New carriers")
flightByTCD$Destination <- as.factor(flightByTCD$Destination)
p <- ggplot(flightByTCD, aes(Airline_Type, Flight_Count)) +   
    geom_bar(aes(fill = Destination), 
             position = "dodge", stat = "identity") + 
    labs(title = "Flight Count by Airline Type and Destination for Each Terminal ", 
         x = "\nAirline Type", y = "Flight Count\n", color = "Destination\n") + 
  scale_fill_discrete( name="Destinations",
                       breaks=c("1", "2", "3"),
                       labels=c("Within California","Out of State","Out of Country")) + 
  theme(axis.text.x = element_text(angle = 30, vjust = 0.2, hjust= 0.2))
 
p <- p + facet_wrap( ~ Terminal, ncol=3)
#ggplotly(p)
p
#Busiest 5 gates by Terminal 
top5GtByTerm <- sfoDf %>% 
  group_by(TERM, GATENUM) %>% 
  summarise(Flight_Count = n()) %>% 
  arrange(desc(Flight_Count)) %>% 
  top_n(n=5) %>% 
  arrange(TERM, GATENUM) %>% 
  rename(Terminal = TERM, Gate_Num = GATENUM)
top5GtByTerm
p <- ggplot(top5GtByTerm, aes(Gate_Num, Flight_Count)) + 
  geom_bar(position = "dodge", stat = "identity") + 
  geom_text(aes(label=Flight_Count), vjust=0) + 
  facet_wrap( ~ Terminal, ncol=3)
p
#Cleanliness Comments by Gender/Income
cleanliness_data <- sfoDf %>%
  select(Age,Gender, Income, 'Q8COM1':'Q8COM2') %>% 
  melt(., id.vars=c('Age','Gender','Income'), value.name = 'Comments') %>% 
  select(-'variable') %>% 
  na.omit(.)
ggplot(cleanliness_data, aes(x=as.factor(Comments), fill=Income)) + 
  geom_bar(stat='count') +
  facet_wrap(~Gender) + 
  labs(title="Cleanliness Comments by Gender/Income")
#Cleanliness Comments by Gender/Age
ggplot(cleanliness_data, aes(x=as.factor(Comments), fill=Age)) + 
  geom_bar(stat='count') +
  facet_wrap(~Gender) +
  labs(title="Cleanliness Comments by Gender/Age")
#Cleanliness Comment Count by Age and Gender
ggplot(cleanliness_data, aes(x=Age)) + 
  geom_bar(stat='count') +
  facet_wrap(~Gender) +
  labs(title="Cleanliness Comment Count by Age and Gender")
#Cleanliness Comment Count by Income and Gender
ggplot(cleanliness_data, aes(x=Income)) + 
  geom_bar(stat='count') +
  facet_wrap(~Gender) +
  labs(title="Cleanliness Comment Count by Income and Gender")+ 
  theme(axis.text.x = element_text(angle = 30, vjust = 0.7))
## Code for Part 2 of the Project ##
##########################################
## Satisfaction Score by Age/Income/Sex ##
##########################################
satisfaction_data <- sfoDf %>%
  select('Gender','Age_Group','Income','Q6N') %>% 
  melt(., id.vars=c('Age_Group','Gender','Income'), value.name = 'Score') %>% 
  select(-'variable') %>% 
  na.omit(.)
satisfaction_data$Income <- factor(satisfaction_data$Income,levels = c('Under $50,000',"$50,000-$100,00","$100,001-$150,000","Over $150,000","Other","Blank"))
satisfaction_data$Age_Group <- factor(satisfaction_data$Age_Group, levels = c( "Under 18","18-24","25-34","35-44","45-54","55-64","65 and Over","Don't Know/Refused","Blank", "NA"))
ggplot(satisfaction_data, aes(x=as.factor(Score), fill=Income)) + 
  geom_bar(stat='count') +
  facet_wrap(~Gender) + 
  labs(title="Satisfaction Score by Gender/Income", x='Score', y='Count')
ggplot(satisfaction_data, aes(x=as.factor(Score), fill=Age_Group)) + 
  geom_bar(stat='count') +
  facet_wrap(~Income) +
  labs(title="Satisfaction Score by Income/Age", x='Score', y='Count')
lm_score <- lm(Score ~ Gender + Age_Group + Income,data = satisfaction_data)
summary(lm_score)
short_survey <- sfoDf %>% 
  select(Q6A:Q6N)
short_survey %>% 
  cor(., use="complete.obs") %>% 
  ggcorrplot(type = "lower")
########################
## Factor Analysis   ##
#######################
short_survey %>% 
  nfactors()
surveyFA_3 <- short_survey %>% 
  fa(., nfactors = 3, rotate = "promax")
surveyFA_3$loadings
surveyFA_4 <- short_survey %>% 
  fa(., nfactors = 4, rotate = "promax")
surveyFA_4$loadings
########################
## Sentiment Analysis##
#######################
#Comment Column
sfoDf$Q7_text_All %>% 
  na.omit %>% 
  head(10)
#Remove the Slash
sfoDf$Q7_text_All <- gsub(pattern = "/", 
                          replacement = " ", 
                          sfoDf$Q7_text_All)
#Remove the hyphens
sfoDf$Q7_text_All <- gsub(pattern = "-", 
                          replacement = " ", 
                          sfoDf$Q7_text_All)
#Verify the data
sfoDf$Q7_text_All %>% 
  na.omit %>% 
  head(10)
#Word Count by each Age Group
wordDf <- sfoDf %>% 
  select(Q7_text_All, Q17, Age_Group) %>% 
   rename(Age = Q17,
         Comments = Q7_text_All) %>% 
  unnest_tokens(word, Comments) %>% 
  anti_join(stop_words) %>% 
  na.omit() %>% 
  group_by(Age_Group, Age, word) %>% 
  summarise(count = n())
#Check the data set
head(wordDf, n = 20)
#Count positive and negative words
wordSent_bing <- wordDf %>% 
  inner_join(get_sentiments("bing")) %>% 
  group_by(Age_Group, Age, sentiment) %>% 
  count(sentiment)  %>% 
  spread(sentiment, n, fill = 0) %>% 
  mutate(sentiment = positive - negative)
#Print the dataset
head(wordSent_bing, n= 10)
#Generate the Dataset for the Plot
plotDf <- wordDf %>% 
  inner_join(get_sentiments("bing")) %>% 
  group_by(Age_Group, Age, sentiment) %>% 
  count(sentiment) %>% 
  rename(count = n)
#Generate the Barplot for the Positive/Negative Sentiment
#For Each Age Group
ggplot(plotDf, aes(Age_Group), ylim(-30:30)) + 
geom_bar(data = subset(plotDf, sentiment == "positive"), 
   aes(y = count, fill = sentiment), stat = "identity", position = "dodge") +
geom_bar(data = subset(plotDf, sentiment == "negative"), 
   aes(y = -count, fill = sentiment), stat = "identity", position = "dodge") + 
geom_hline(yintercept = 0,colour = "grey90") + 
  #Now Add the Tect to it
geom_text(data = subset(plotDf, sentiment == "positive"), 
      aes(Age_Group, count, group=sentiment, label=count),
        position = position_dodge(width=0.9), vjust = 1.5, size=4) +
geom_text(data = subset(plotDf, sentiment == "negative"), 
      aes(Age_Group, -count, group=sentiment, label=count),
        position = position_dodge(width=0.9), vjust = -.5, size=4) +
    coord_cartesian(ylim = c(-30, 30))
#Calculate Average sentiment for each Age Group
avgSentAgeGrp <- wordDf %>% 
  inner_join(get_sentiments("afinn")) %>% 
  group_by(Age_Group) %>% 
  summarise( totalWords = sum(count), sentSum = sum(value)) %>% 
  mutate(Avg_Sentiment = sentSum / totalWords)
#Print the dataset
head(avgSentAgeGrp, n=20)
#Word Count by each Gender
wordDf <- sfoDf %>% 
  select(Gender,  Q7_text_All) %>% 
   rename(Comments = Q7_text_All) %>% 
  unnest_tokens(word, Comments) %>% 
  anti_join(stop_words) %>% 
  na.omit() %>% 
  group_by(Gender, word) %>% 
  summarise(count = n())
#Check the data set
head(wordDf, n = 20)
#Count positive and negative words
wordSent_bing <- wordDf %>% 
  inner_join(get_sentiments("bing")) %>% 
  group_by(Gender, sentiment) %>% 
  count(sentiment)  %>% 
  spread(sentiment, n, fill = 0) %>% 
  mutate(sentiment = positive - negative)
#Print the dataset
head(wordSent_bing, n= 10)
#Generate the Dataset for the Plot
plotDf <- wordDf %>% 
  inner_join(get_sentiments("bing")) %>% 
  group_by(Gender, sentiment) %>% 
  count(sentiment) %>% 
  rename(count = n)
#Generate the Barplot for the Positive/Negative Sentiment
#For Each Gender
ggplot(plotDf, aes(Gender), ylim(-20:20)) + 
geom_bar(data = subset(plotDf, sentiment == "positive"), 
   aes(y = count, fill = sentiment), stat = "identity", position = "dodge") +
geom_bar(data = subset(plotDf, sentiment == "negative"), 
   aes(y = -count, fill = sentiment), stat = "identity", position = "dodge") + 
geom_hline(yintercept = 0,colour = "grey90") + 
  #Now Add the Tect to it
geom_text(data = subset(plotDf, sentiment == "positive"), 
      aes(Gender, count, group=sentiment, label=count),
        position = position_dodge(width=0.9), vjust = 1.5, size=4) +
geom_text(data = subset(plotDf, sentiment == "negative"), 
      aes(Gender, -count, group=sentiment, label=count),
        position = position_dodge(width=0.9), vjust = -.5, size=4) +
    coord_cartesian(ylim = c(-20, 20))
#Calculate Average sentiment for each Gender
avgSentSex <- wordDf %>% 
  inner_join(get_sentiments("afinn")) %>% 
  group_by(Gender) %>% 
  summarise( totalWords = sum(count), sentSum = sum(value)) %>% 
  mutate(Avg_Sentiment = sentSum / totalWords)
#Print the dataset
head(avgSentSex, n=20
     
#Word Count by each Income Group
wordDf <- sfoDf %>% 
  select(Q7_text_All, Income) %>% 
   rename(Comments = Q7_text_All) %>% 
  unnest_tokens(word, Comments) %>% 
  anti_join(stop_words) %>% 
  na.omit() %>% 
  group_by(Income, word) %>% 
  summarise(count = n())
#Check the data set
head(wordDf, n = 20)
#Count positive and negative words
wordSent_bing <- wordDf %>% 
  inner_join(get_sentiments("bing")) %>% 
  group_by(Income, sentiment) %>% 
  count(sentiment)  %>% 
  spread(sentiment, n, fill = 0) %>% 
  mutate(sentiment = positive - negative)
#Print the dataset
head(wordSent_bing, n= 10)
#Generate the Dataset for the Plot
plotDf <- wordDf %>% 
  inner_join(get_sentiments("bing")) %>% 
  group_by(Income, sentiment) %>% 
  count(sentiment) %>% 
  rename(count = n)
#Generate the Barplot for the Positive/Negative Sentiment
#For Each Income Group
ggplot(plotDf, aes(Income), ylim(-20:20)) + 
geom_bar(data = subset(plotDf, sentiment == "positive"), 
   aes(y = count, fill = sentiment), stat = "identity", position = "dodge") +
geom_bar(data = subset(plotDf, sentiment == "negative"), 
   aes(y = -count, fill = sentiment), stat = "identity", position = "dodge") + 
geom_hline(yintercept = 0,colour = "grey90") + 
  geom_text(data = subset(plotDf, sentiment == "positive"), 
      aes(Income, count, group=sentiment, label=count),
        position = position_dodge(width=0.9), vjust = 1.5, size=4) +
geom_text(data = subset(plotDf, sentiment == "negative"), 
      aes(Income, -count, group=sentiment, label=count),
        position = position_dodge(width=0.9), vjust = -.5, size=4) +
    coord_cartesian(ylim = c(-20, 20))
#Calculate Average sentiment for each income Group
avgSentIncome <- wordDf %>% 
  inner_join(get_sentiments("afinn")) %>% 
  group_by(Income) %>% 
  summarise( totalWords = sum(count), sentSum = sum(value)) %>% 
  mutate(sentiment = sentSum / totalWords)
#Print the dataset
head(avgSentIncome, n=20)
############################
## Latent Class Analysis  ##
###########################
#select question 6 questions we are interested in
sfoDfq2 <- sfoDf %>% dplyr::select(Q6A:Q6N) %>% drop_na()
sfoDfq2$Q6A[sfoDfq2$Q6A == 0] = 6
sfoDfq2$Q6B[sfoDfq2$Q6B == 0] = 6
sfoDfq2$Q6C[sfoDfq2$Q6C == 0] = 6
sfoDfq2$Q6D[sfoDfq2$Q6D == 0] = 6
sfoDfq2$Q6E[sfoDfq2$Q6E == 0] = 6
sfoDfq2$Q6F[sfoDfq2$Q6F == 0] = 6
sfoDfq2$Q6G[sfoDfq2$Q6G == 0] = 6
sfoDfq2$Q6H[sfoDfq2$Q6H == 0] = 6
sfoDfq2$Q6I[sfoDfq2$Q6I == 0] = 6
sfoDfq2$Q6J[sfoDfq2$Q6J == 0] = 6
sfoDfq2$Q6K[sfoDfq2$Q6K == 0] = 6
sfoDfq2$Q6L[sfoDfq2$Q6L == 0] = 6
sfoDfq2$Q6M[sfoDfq2$Q6M == 0] = 6
sfoDfq2$Q6N[sfoDfq2$Q6N == 0] = 6
lcaFormula = cbind(Q6A, Q6B, Q6C, Q6D, Q6E, Q6F, Q6G, Q6H, Q6I, Q6J, Q6K, Q6L, Q6M, Q6N) ~ 1
lcaAllqsClasses = poLCA(lcaFormula, sfoDf, nclass = 5, maxiter = 10000, verbose = F)
#Plot the LCA
plot(lcaAllqsClasses))
########################
## Topic Models      ##
#######################
feedbackText = textProcessor(documents = sfoDf$Q7_text_All, 
                           metadata = sfoDf)
rvest::guess_encoding(sfoDf$Q7_text_All)
#Convert the comments to all uppercase
sfoDf$Q7_text_All <- toupper(sfoDf$Q7_text_All)
#Find the frequency of airport
nrow(sfoDf[grepl("AIRPORT", sfoDf$Q7_text_All), ])
feedbackTextProcess = textProcessor(documents = sfoDf$Q7_text_All, 
                           metadata = sfoDf, 
                           onlycharacter = TRUE,
                           customstopwords = c("airport", 
                                               tm::stopwords("SMART"), 
                                               tm::stopwords("en")))
feedbackTextPrep = prepDocuments(documents = feedbackTextProcess$documents, 
                               vocab = feedbackTextProcess$vocab,
                               meta = feedbackTextProcess$meta)
kTest = searchK(documents = feedbackTextPrep$documents, 
             vocab = feedbackTextProcess$vocab, 
             K = c(2, 3, 4, 5), verbose = FALSE)
plot(kTest)
topics4 = stm(documents = feedbackTextPrep$documents, 
             vocab = feedbackTextPrep$vocab, 
             K = 4, verbose = FALSE)
plot(topics4)
labelTopics(topics4)
findThoughts(topics4, texts = feedbackTextPrep$meta$Q7_text_All , n = 1)